Clustering with scikit-learn

In this notebook, we will learn how to perform k-means lustering using scikit-learn in Python.

We will use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data. In this dataset, we have in the order of millions records. How do we create 12 clusters our of them?

NOTE: The dataset we will use is in a large CSV file called minute_weather.csv. Please download it into the weather directory in your Week-7-MachineLearning folder. The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg


Importing the Necessary Libraries


In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#import utils
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates

%matplotlib inline


Creating a Pandas DataFrame from a CSV file


In [2]:
data = pd.read_csv('./weather/minute_weather.csv')

Minute Weather Data Description


The minute weather dataset comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file minute_weather.csv, which is a comma-separated file.

As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:

  • rowID: unique number for each row (Unit: NA)
  • hpwren_timestamp: timestamp of measure (Unit: year-month-day hour:minute:second)
  • air_pressure: air pressure measured at the timestamp (Unit: hectopascals)
  • air_temp: air temperature measure at the timestamp (Unit: degrees Fahrenheit)
  • avg_wind_direction: wind direction averaged over the minute before the timestamp (Unit: degrees, with 0 means coming from the North, and increasing clockwise)
  • avg_wind_speed: wind speed averaged over the minute before the timestamp (Unit: meters per second)
  • max_wind_direction: highest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and increasing clockwise)
  • max_wind_speed: highest wind speed in the minute before the timestamp (Unit: meters per second)
  • min_wind_direction: smallest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and inceasing clockwise)
  • min_wind_speed: smallest wind speed in the minute before the timestamp (Unit: meters per second)
  • rain_accumulation: amount of accumulated rain measured at the timestamp (Unit: millimeters)
  • rain_duration: length of time rain has fallen as measured at the timestamp (Unit: seconds)
  • relative_humidity: relative humidity measured at the timestamp (Unit: percent)

In [3]:
data.shape


Out[3]:
(1587257, 13)

In [4]:
data.head()


Out[4]:
rowID hpwren_timestamp air_pressure air_temp avg_wind_direction avg_wind_speed max_wind_direction max_wind_speed min_wind_direction min_wind_speed rain_accumulation rain_duration relative_humidity
0 0 2011-09-10 00:00:49 912.3 64.76 97.0 1.2 106.0 1.6 85.0 1.0 NaN NaN 60.5
1 1 2011-09-10 00:01:49 912.3 63.86 161.0 0.8 215.0 1.5 43.0 0.2 0.0 0.0 39.9
2 2 2011-09-10 00:02:49 912.3 64.22 77.0 0.7 143.0 1.2 324.0 0.3 0.0 0.0 43.0
3 3 2011-09-10 00:03:49 912.3 64.40 89.0 1.2 112.0 1.6 12.0 0.7 0.0 0.0 49.5
4 4 2011-09-10 00:04:49 912.3 64.40 185.0 0.4 260.0 1.0 100.0 0.1 0.0 0.0 58.8


Data Sampling

Lots of rows, so let us sample down by taking every 10th row.


In [5]:
sampled_df = data[(data['rowID'] % 10) == 0]
sampled_df.shape


Out[5]:
(158726, 13)


Statistics


In [6]:
sampled_df.describe().transpose()


Out[6]:
count mean std min 25% 50% 75% max
rowID 158726.0 793625.000000 458203.937509 0.00 396812.5 793625.00 1190437.50 1587250.00
air_pressure 158726.0 916.830161 3.051717 905.00 914.8 916.70 918.70 929.50
air_temp 158726.0 61.851589 11.833569 31.64 52.7 62.24 70.88 99.50
avg_wind_direction 158680.0 162.156100 95.278201 0.00 62.0 182.00 217.00 359.00
avg_wind_speed 158680.0 2.775215 2.057624 0.00 1.3 2.20 3.80 31.90
max_wind_direction 158680.0 163.462144 92.452139 0.00 68.0 187.00 223.00 359.00
max_wind_speed 158680.0 3.400558 2.418802 0.10 1.6 2.70 4.60 36.00
min_wind_direction 158680.0 166.774017 97.441109 0.00 76.0 180.00 212.00 359.00
min_wind_speed 158680.0 2.134664 1.742113 0.00 0.8 1.60 3.00 31.60
rain_accumulation 158725.0 0.000318 0.011236 0.00 0.0 0.00 0.00 3.12
rain_duration 158725.0 0.409627 8.665523 0.00 0.0 0.00 0.00 2960.00
relative_humidity 158726.0 47.609470 26.214409 0.90 24.7 44.70 68.00 93.00

In [7]:
sampled_df[sampled_df['rain_accumulation'] == 0].shape


Out[7]:
(157812, 13)

In [8]:
sampled_df[sampled_df['rain_duration'] == 0].shape


Out[8]:
(157237, 13)


Drop all the Rows with Empty rain_duration and rain_accumulation


In [9]:
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']

In [10]:
rows_before = sampled_df.shape[0]
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]


How many rows did we drop ?


In [11]:
rows_before - rows_after


Out[11]:
46

In [12]:
sampled_df.columns


Out[12]:
Index(['rowID', 'hpwren_timestamp', 'air_pressure', 'air_temp',
       'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction',
       'max_wind_speed', 'min_wind_direction', 'min_wind_speed',
       'relative_humidity'],
      dtype='object')


Select Features of Interest for Clustering


In [13]:
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']

In [14]:
select_df = sampled_df[features]

In [15]:
select_df.columns


Out[15]:
Index(['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed',
       'max_wind_direction', 'max_wind_speed', 'relative_humidity'],
      dtype='object')

In [16]:
select_df


Out[16]:
air_pressure air_temp avg_wind_direction avg_wind_speed max_wind_direction max_wind_speed relative_humidity
0 912.3 64.76 97.0 1.2 106.0 1.6 60.5
10 912.3 62.24 144.0 1.2 167.0 1.8 38.5
20 912.2 63.32 100.0 2.0 122.0 2.5 58.3
30 912.2 62.60 91.0 2.0 103.0 2.4 57.9
40 912.2 64.04 81.0 2.6 88.0 2.9 57.4
50 912.1 63.68 102.0 1.2 119.0 1.5 51.4
60 912.0 64.04 83.0 0.7 101.0 0.9 51.4
70 911.9 64.22 82.0 2.0 97.0 2.4 62.2
80 911.9 61.70 67.0 3.3 70.0 3.5 71.5
90 911.9 61.34 67.0 3.6 75.0 4.2 72.5
100 911.8 62.96 95.0 2.3 106.0 2.5 63.9
110 911.8 64.22 83.0 2.1 88.0 2.5 59.1
120 911.8 63.86 68.0 2.1 76.0 2.4 63.5
130 911.6 64.40 156.0 0.5 203.0 0.7 50.4
140 911.5 65.30 85.0 2.2 92.0 2.5 58.0
150 911.4 64.58 154.0 1.3 176.0 2.1 50.2
160 911.4 65.48 154.0 0.9 208.0 1.9 46.2
170 911.5 65.66 95.0 1.1 109.0 1.6 45.2
180 911.4 65.66 155.0 1.1 167.0 1.6 42.8
190 911.4 67.10 157.0 1.2 172.0 1.6 36.8
200 911.4 68.00 53.0 0.3 69.0 0.5 33.4
210 911.3 67.64 167.0 1.5 196.0 2.2 34.4
220 911.4 67.82 4.0 0.6 25.0 0.7 34.2
230 911.4 66.74 172.0 1.3 192.0 1.9 37.8
240 911.4 66.56 39.0 0.2 145.0 0.3 41.6
250 911.4 65.66 56.0 1.9 67.0 2.2 51.8
260 911.5 65.66 74.0 0.8 101.0 1.2 41.1
270 911.4 66.92 147.0 0.9 174.0 1.1 36.0
280 911.3 64.76 73.0 1.0 82.0 1.2 43.3
290 911.3 64.94 164.0 1.3 176.0 1.7 43.0
... ... ... ... ... ... ... ...
1586960 914.7 76.46 247.0 0.6 264.0 0.7 43.4
1586970 914.8 76.28 208.0 0.7 216.0 0.9 43.7
1586980 914.8 76.10 209.0 0.7 216.0 0.9 43.9
1586990 914.9 76.28 339.0 0.5 350.0 0.7 43.4
1587000 914.9 75.92 344.0 0.4 352.0 0.6 43.9
1587010 915.0 75.56 323.0 0.3 348.0 0.5 45.5
1587020 915.1 75.56 324.0 1.1 347.0 1.5 46.0
1587030 915.1 75.74 1.0 1.3 13.0 1.7 45.8
1587040 915.2 75.38 355.0 0.9 1.0 1.1 46.1
1587050 915.3 75.38 359.0 1.4 11.0 1.5 45.8
1587060 915.4 75.38 11.0 1.1 21.0 1.3 45.7
1587070 915.5 75.38 13.0 1.4 24.0 1.6 46.6
1587080 915.6 75.20 18.0 1.0 24.0 1.2 46.5
1587090 915.6 75.20 356.0 1.7 1.0 1.9 47.2
1587100 915.7 75.38 13.0 1.5 24.0 1.7 46.7
1587110 915.7 75.02 19.0 1.2 28.0 1.4 46.7
1587120 915.7 74.84 25.0 1.4 35.0 1.6 46.5
1587130 915.8 74.84 23.0 1.3 30.0 1.5 46.9
1587140 915.8 74.84 32.0 1.4 41.0 1.7 45.5
1587150 915.8 75.20 23.0 1.1 31.0 1.4 45.7
1587160 915.8 75.38 16.0 1.2 28.0 1.5 46.3
1587170 915.7 75.38 347.0 1.2 353.0 1.4 48.1
1587180 915.8 75.74 326.0 1.2 337.0 1.6 48.3
1587190 915.9 75.92 289.0 0.7 309.0 0.9 48.1
1587200 915.9 75.74 335.0 0.9 348.0 1.1 47.8
1587210 915.9 75.56 330.0 1.0 341.0 1.3 47.8
1587220 915.9 75.56 330.0 1.1 341.0 1.4 48.0
1587230 915.9 75.56 344.0 1.4 352.0 1.7 48.0
1587240 915.9 75.20 359.0 1.3 9.0 1.6 46.3
1587250 915.9 74.84 6.0 1.5 20.0 1.9 46.1

158680 rows × 7 columns


Scale the Features using StandardScaler


In [17]:
X = StandardScaler().fit_transform(select_df)
X


Out[17]:
array([[-1.48456281,  0.24544455, -0.68385323, ..., -0.62153592,
        -0.74440309,  0.49233835],
       [-1.48456281,  0.03247142, -0.19055941, ...,  0.03826701,
        -0.66171726, -0.34710804],
       [-1.51733167,  0.12374562, -0.65236639, ..., -0.44847286,
        -0.37231683,  0.40839371],
       ..., 
       [-0.30488381,  1.15818654,  1.90856325, ...,  2.0393087 ,
        -0.70306017,  0.01538018],
       [-0.30488381,  1.12776181,  2.06599745, ..., -1.67073075,
        -0.74440309, -0.04948614],
       [-0.30488381,  1.09733708, -1.63895404, ..., -1.55174989,
        -0.62037434, -0.05711747]])


Use k-Means Clustering


In [18]:
kmeans = KMeans(n_clusters=12)
model = kmeans.fit(X)
print("model\n", model)


model
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=12, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)


What are the centers of 12 clusters we formed ?


In [22]:
centers = model.cluster_centers_
centers


Out[22]:
array([[ 0.23414265,  0.32088768,  1.88794018, -0.65174648, -1.55179078,
        -0.57663111, -0.28415129],
       [-0.21101616,  0.63388782,  0.40861282,  0.73377418,  0.51681119,
         0.67191387, -0.15134073],
       [-0.6969227 ,  0.54216256,  0.17702903, -0.5840737 ,  0.34628434,
        -0.59745567, -0.11354633],
       [-1.18235195, -0.86948308,  0.4468512 ,  1.98489163,  0.53827387,
         1.94597277,  0.90759772],
       [ 0.73141523,  0.43294657,  0.28515211, -0.5344004 ,  0.47287075,
        -0.5407336 , -0.76947082],
       [-0.16068847,  0.86265214, -1.31098811, -0.58986313, -1.16663766,
        -0.60518798, -0.64293243],
       [ 1.36987489, -0.08376038, -1.20690989, -0.0454475 , -1.07590457,
        -0.02492413, -0.97762873],
       [ 0.23733228, -0.99817197,  0.65636998, -0.54708994,  0.84558209,
        -0.52972043,  1.16473384],
       [ 0.06019157, -0.78770058, -1.19735701, -0.5706887 , -1.04352902,
        -0.58526678,  0.87793487],
       [ 1.18984935, -0.25485028, -1.15497786,  2.12621668, -1.05348987,
         2.24320671, -1.13475959],
       [ 0.13216168,  0.84256194,  1.41031142, -0.63874493,  1.67440929,
        -0.58952661, -0.71342998],
       [-0.8393307 , -1.200436  ,  0.37569195,  0.37534678,  0.47419514,
         0.36282493,  1.36099638]])


Plots

Let us first create some utility functions which will help us in plotting graphs:


In [23]:
# Function that creates a DataFrame with a column for Cluster Number

def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P

In [47]:
# Function that creates Parallel Plots

def parallel_plot(data):
    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
    #print(my_colors)
    plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
    parallel_coordinates(data, 'prediction', color = my_colors, marker='o')

In [48]:
P = pd_centers(features, centers)
P


Out[48]:
air_pressure air_temp avg_wind_direction avg_wind_speed max_wind_direction max_wind_speed relative_humidity prediction
0 0.234143 0.320888 1.887940 -0.651746 -1.551791 -0.576631 -0.284151 0
1 -0.211016 0.633888 0.408613 0.733774 0.516811 0.671914 -0.151341 1
2 -0.696923 0.542163 0.177029 -0.584074 0.346284 -0.597456 -0.113546 2
3 -1.182352 -0.869483 0.446851 1.984892 0.538274 1.945973 0.907598 3
4 0.731415 0.432947 0.285152 -0.534400 0.472871 -0.540734 -0.769471 4
5 -0.160688 0.862652 -1.310988 -0.589863 -1.166638 -0.605188 -0.642932 5
6 1.369875 -0.083760 -1.206910 -0.045448 -1.075905 -0.024924 -0.977629 6
7 0.237332 -0.998172 0.656370 -0.547090 0.845582 -0.529720 1.164734 7
8 0.060192 -0.787701 -1.197357 -0.570689 -1.043529 -0.585267 0.877935 8
9 1.189849 -0.254850 -1.154978 2.126217 -1.053490 2.243207 -1.134760 9
10 0.132162 0.842562 1.410311 -0.638745 1.674409 -0.589527 -0.713430 10
11 -0.839331 -1.200436 0.375692 0.375347 0.474195 0.362825 1.360996 11

Dry Days


In [49]:
parallel_plot(P[P['relative_humidity'] < -0.5])


Warm Days


In [50]:
parallel_plot(P[P['air_temp'] > 0.5])


Cool Days


In [51]:
parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])